Using Python to analyse data on the coronavirus

Saeed Amen - https://www.cuemacro.com - saeed@cuemacro.com

First of all I want to stress that I am not a medic. However, I am a quant, and as such I wanted to see if there are simple ways I could look at the data related to the coronavirus. In particular, I'm doing some very simple plots and analysis, but I hope you find this code useful in any case (and feel free to copy and reuse). Our data source is one compiled by John Hopkins, which they've made available on GitHub at https://github.com/CSSEGISandData/COVID-19 and is being updated daily. This dataset drives a lot of analysis that has been published on the coronavirus.

I've tried to stick to libraries which are relatively common like Pandas and like part of the standard Anaconda installation. If you want to use these notebooks in an easy to use way, I would recommend loading them on Azure Notebooks (http://notebooks.azure.com) which has a free version), and you can rerun as new data is released. If you want to setup your Python environment similar to mine follow the instructions at https://github.com/cuemacro/teaching/blob/master/pythoncourse/installation/installing_anaconda_and_pycharm.ipynb - this will pretty much install all the libraries you'll ever need for data science. I've also made a public Azure notebook version of this project at https://notebooks.azure.com/saeedamen/projects/coronavirus.

Loading the data

We shall be looking at several CSV from the GitHub site which have time series number of confirmed cases, deaths and recovery cases related to COVID-19, and these are updated on a regular basis. Note, we are getting the raw content from these GitHub pages.

We'll also want to download some population data too from OECD, which has this path.

Let's do some imports of libraries and creation of objects we'll need to use later.

Let's check my version of pandas. Mine is 0.24.2, but this notebook will likely work with other versions too.

Let's load up all the datasets

Check the format up of the downloaded data for the confirmed case.

We'll write a function to make the data easier to work with, transposing the dataset and making the date the index of our dataframe. We also add labels to the columns.

Let's run that function on our datasets.

We can now see that dataset is in an easier to use format.

Plotting the cases data

We can try plotting confirmed cases in Italy and China.

Adjusting data by population

Obviously, it's difficult to compare these values, because of the differences in population. Let's download that data from OECD. We'll just pick up population data for 2014, the latest year available, and we strip down the dataframe, so it only has population values. Note, that this data isn't going to change often, so we can just cache it locally as a CSV.

If we look at the OECD data, it contains ISO codes, rather than the country names. We want to be able to convert the codes to country names, so we join it with our coronavirus dataset from earlier.

Let's now get the ISO codes from the various countries. We can get these from plotly express's in built gapminder datasets. Martien Lubberink has suggested the pycountry library which makes the whole process of getting ISO codes from countries much easier.

Let's have a look at the ISO code dataset.

We can replace the OECD ISO codes in the columns with the country names (with exceptions), so both datasets have matching names.

We see we have relabelled the countries and also remove for example EU28 and OECD, given the coronavirus dataset doesn't have this data for aggregated grouping of countries. We also remove countries which are still in code format (ie. 3 letters long).

Our dataset now only has the population values. One slight complication is that currency codes do not match our earlier datasets. The population of Italy is 60 million, whereas for China it is 1385 million. Admittedly this might be an simplistic comparison, given that most of the cases in China were in a specific area Hubei, which has a population of 58.5 million (from https://en.wikipedia.org/wiki/Hubei). Also for Italy most of the cases at the start of the episode have been in Northern Italy.

Let's create a copy of the confirmed data and we'll then normalize the values by population for OECD countries where we have population data.

If we plot the normalized figure for a few countries, the number of confirmed cases looks worse in Italy compared to China (but there are some caveats about Chinese data). Note also the different starting points, later on, we'll create functions, to adjust for the start of the outbreaks, to make the comparison more even.

Let's compare the total number of confirmed cases in Italy and the United Kingdom, which have relatively similar population levels. In this case we have shifted the UK values by 2 weeks. Thus far it seems as though the UK's confirmed cases is following the same pattern as Italy.

Let's combine the confirmed cases with the deaths data, so we can compare the data.

Let's plot it for Italy.

Adjusting confirmed cases by deaths

Next, we want to calculate the number of confirmed/death percentage. Note, for a large part of the sample this will be undefined for Italy (when there were no confirmed cases or deaths, which we shall label as zero).

We can plot the percentages for China and Italy. For China it's been between 2-4%, although it appears higher in some sections of the time series in Italy. One reason can also be the number of tests conducted versus the overall population. As of March 3, 2020, WHO reported a figure of 3.4% globally.

Let's take the last available date for values, and then convert our DataFrame into tidy format (see https://cfss.uchicago.edu/notes/tidy-data/ for a description), which can be utilised more easily by certain Python libraries.

Let's display the tidy DataFrame.

Plotting cases on a world map

We can now plot it using plotly express to create a world map using the death/confirmed ratio. But before we do that, we need to get the iso_codes for each country, which we got earlier from plotly express's in built gapminder datasets.

We now have a DataFrame with the additional iso_alpha column.

We can plot a world map of the death/confirmed percentages. Some might seem very high, because of a very small number of deaths and confirmed cases. Also have a look at https://covidlive.co.uk/ which has a live updated map for UK cases.

Adjusting timelines by lags

Earlier, we showed some plots of UK and Italy, where the UK was lagged by 14 days, to take into account the relative starts of the outbreaks. Can we do this in a systematic way, rather than arbitrarily trying to pick a lag. One way to do this is to reindex our data, so that the timeline is automically shifted. The FT has produced some excellent visualisations based on shifts related to:

We'll copy the same approach used by the FT, and other analysis based on this dataset. (Thanks Ewan Kirk for contributing to this!)

Let's now plot the lagged time series by days since 100th confirmed case for a few countries.

To do a log scale, we just need to add one line

We can also plot by confirmed cases/population.

Interactive plots

We have created several DataFrame earlier, which should still be in memory, that are described below.

We need to remove the confirmed and deaths labels from the columns to make them easier to plot interactively.

In our earlier examples, we created static charts using Matplotlib. However, what if you want to make them interactive, so users can select the countries to examine? We can do that using ipywidgets, which allow us to create interactive sliders, radio buttons etc. in a Jupyter Notebook.

Thanks for Ewan Kirk for providing the code below for using ipywidgets (note, the widgets will only show up if you run the Jupyter notebook live). Note, that given we don't have population data for every country, the normalized data won't show up for every country.

Let's kick off the interactive widget and plot. Note, that we can change the default list of countries. We only have normalized data for those countries in the OECD (given our database for population data is only for OECD countries).

Change log